Abstract

Chakraborty (2021) examines the relationship between COVID-19 incidence rates and the percentage of people with disabilities across U.S. counties, considering socio-demographic factors like race, ethnicity, poverty status, age, and sex. Using bivariate correlations and generalized estimating equation (GEE) models, the study finds significant positive associations between COVID-19 rates and socially vulnerable populations. This reproduction study aims to verify Chakraborty’s findings for policy, research, and educational purposes by replicating all analyses, including county-level COVID-19 distributions, statistical correlations, and GEE models. The data and code are publicly available on GitHub, ensuring transparency and accessibility for further study.

Keywords

COVID-19, Race, Disability

Study Design

This is a reproduction study to understand the links between race/ethnicity and disability and COVID-19 rates.

Packages

packages <- c(
  "tidycensus", "tidyverse", "downloader", "sf", "classInt", "readr",
  "here", "s2", "pastecs", "tmap", "SpatialEpi", "svDialogs",
  "geepack", "knitr", "kableExtra", "foreign", "broom", "dotwhisker", "dotenv"
)

# load and install required packages
if(!require(groundhog)){
  install.packages("groundhog")
  require(groundhog)
}
## Loading required package: groundhog
## groundhog says: No default repository found, setting to 'http://cran.r-project.org/'
## Attached: 'Groundhog' (Version: 3.2.2)
## Tips and troubleshooting: https://groundhogR.com
if(!require(here)){
  install.packages("here")
  require(here)
}
## Loading required package: here
## here() starts at /Users/gushoward/Downloads/GITHUB/chakraborty-rpr1
groundhog.day <- "2025-03-08"
groundhog.library(packages, groundhog.day)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
## 
## 
## Attaching package: 'pastecs'
## 
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
## 
## 
## Loading required package: sp
## 
## 
## Attaching package: 'kableExtra'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
## 
## 
## Successfully attached 'tidycensus_1.7.1'
## 
## Successfully attached 'tidyverse_2.0.0'
## 
## Successfully attached 'downloader_0.4'
## 
## Successfully attached 'sf_1.0-19'
## 
## Successfully attached 'classInt_0.4-11'
## 
## Successfully attached 'readr_2.1.5'
## 
## Had already  attached 'here_1.0.1'
## 
## Successfully attached 's2_1.1.7'
## 
## Successfully attached 'pastecs_1.4.2'
## 
## Successfully attached 'tmap_4.0'
## 
## Successfully attached 'SpatialEpi_1.2.8'
## 
## Successfully attached 'svDialogs_1.1.0'
## 
## Successfully attached 'geepack_1.3.12'
## 
## Successfully attached 'knitr_1.49'
## 
## Successfully attached 'kableExtra_1.4.0'
## 
## Successfully attached 'foreign_0.8-88'
## 
## Successfully attached 'broom_1.0.7'
## 
## Successfully attached 'dotwhisker_0.8.3'
## 
## Successfully attached 'dotenv_1.0.3'

DATA

Study metadata

Key words: COVID-19, Disability, Race/ethnicity, Poverty, Reproducibility Subject: Social and Behavioral Sciences: Geography: Geographic Information Sciences Spatial Coverage: Continental United States Spatial Resolution: US Counties Spatial Reference System: Contiguous USA Albers Equal Area projection EPSG:5070 Temporal Coverage: From 1/22/2020 to 8/1/2020 Temporal Resolution: Rates were collected as one temporal unit.

ACS data

The American Community Survey (ACS) five-year estimate (2014-2018) variables used in the study are outlined in the table below. Details on ACS data collection can be found at https://www.census.gov/topics/health/disability/guidance/data-collection-acs.html

 # Query disability demographic data with geographic boundaries

acs <- get_acs(geography = "county",
  table = "S1810",
  year = 2018,
  output = "wide",
  cache_table = TRUE,
  geometry = TRUE,
  keep_geo_vars = TRUE
)

# Query poverty and disability data
acs_pov <- get_acs(geography = "county",
  table = "C18130",
  year = 2018,
  output = "wide",
  cache_table = TRUE
)

# Query state geographic data
state <- get_acs(geography = "state",
  year = 2018,
  variables = c("B01001_001"),
  geometry = TRUE,
  keep_geo_vars = TRUE
)

# Save query results
saveRDS(acs, here("data", "raw", "public", "acs.RDS"))
saveRDS(acs_pov, here("data", "raw", "public", "acs_pov.RDS"))
saveRDS(state, here("data", "raw", "public", "state.RDS"))
acs <- readRDS(here("data", "raw", "public", "acs.RDS"))
acs_pov <- readRDS(here("data", "raw", "public", "acs_pov.RDS"))
state <- readRDS(here("data", "raw", "public", "state.RDS"))

Data Transformations

Initial Study does not include Alaska, Hawaii, or Puerto Rico. Data on people with disabilities in poverty is derived from a different census table (C18130) than data on people with disabilities and age, race, ethnicity, age, and biological sex (S1810).

# transform coordinate system and fix geometries
acs <- filter(acs, !STATEFP %in% c("02", "15", "72")) %>%
  st_transform(5070) %>%
  st_make_valid()

# Remove Alaska, Hawaii & Puerto Rico,
state <- filter(state, !STATEFP %in% c("02", "15", "72")) %>%
  st_transform(5070)

# Join poverty data to disability data
acs <- acs %>% left_join(acs_pov, by = "GEOID")
rm(acs_pov)
acs_derived <- mutate(acs,
  dis_pct = S1810_C02_001E / S1810_C01_001E * 100,
  white_pct = S1810_C02_004E / S1810_C01_001E * 100,
  black_pct = S1810_C02_005E / S1810_C01_001E * 100,
  native_pct = S1810_C02_006E / S1810_C01_001E * 100,
  asian_pct = S1810_C02_007E / S1810_C01_001E * 100,
  other_pct =
    (S1810_C02_008E + S1810_C02_009E + S1810_C02_010E) / S1810_C01_001E * 100,
  non_hisp_white_pct = S1810_C02_011E / S1810_C01_001E * 100,
  hisp_pct = S1810_C02_012E / S1810_C01_001E * 100,
  non_hisp_non_white_pct =
    (S1810_C02_001E - S1810_C02_012E - S1810_C02_011E) / S1810_C01_001E * 100,
  bpov_pct = (C18130_004E + C18130_011E + C18130_018E) / C18130_001E * 100,
  apov_pct = (C18130_005E + C18130_012E + C18130_019E) / C18130_001E * 100,
  pct_5_17 = S1810_C02_014E / S1810_C01_001E * 100,
  pct_18_34 = S1810_C02_015E / S1810_C01_001E * 100,
  pct_35_64 = S1810_C02_016E / S1810_C01_001E * 100,
  pct_65_74 = S1810_C02_017E / S1810_C01_001E * 100,
  pct_75 = S1810_C02_018E / S1810_C01_001E * 100,
  male_pct = S1810_C02_002E / S1810_C01_001E * 100,
  female_pct = S1810_C02_003E / S1810_C01_001E * 100
)

# select only relevant geographic identifiers and derived percentages
acs_derived <- acs_derived %>%
  select(
    fips = GEOID,
    statefp = STATEFP,
    county = NAME.x,
    county_st = NAME,
    contains("pct")
  )

COVID Data

The data includes an estimate of the total population (POP_ESTIMA) and confirmed COVID-19 cases (Confirmed). The COVID-19 case data expresses cumulative count of reported COVID-19 from 1/22/2020 to 8/1/2020. Versions of the data can be found at the John Hopkins CCSE COVID-19 Data Repository (https://github.com/CSSEGISandData/COVID-19).

covid <- read_sf(here("data", "raw", "public", "covidcase080120.gpkg"))

# select and rename the fips code, population, cases, and x,y coordinates
covid <- select(covid,
  fips = FIPS,
  pop = POP_ESTIMA,
  cases = Confirmed,
  x = X, y = Y
)

covid_table <- covid %>%
  mutate(covid_rate = round(covid$cases / covid$pop * 100000, 2)) %>%
  st_drop_geometry()

Join COVID and ACS

# Join COVID incidence rate data to acs data
acs_covid <- acs_derived %>%
  left_join(covid_table, by = "fips")

# move covid_rate column prior to disability percentages
acs_covid <- acs_covid %>%
  select(fips, statefp, county, county_st, covid_rate, everything())

rm(acs, acs_derived, covid)

missing data

# county with missing data
filter(acs_covid, is.na(bpov_pct)) %>% st_drop_geometry() %>% kable()
fips statefp county county_st covid_rate dis_pct white_pct black_pct native_pct asian_pct other_pct non_hisp_white_pct hisp_pct non_hisp_non_white_pct bpov_pct apov_pct pct_5_17 pct_18_34 pct_35_64 pct_65_74 pct_75 male_pct female_pct pop cases x y
35039 35 Rio Arriba Rio Arriba County, New Mexico 751.17 16.06467 10.77458 0.038371 2.744807 0.038371 2.468536 2.355981 11.39619 2.312494 NA NA 0.3069682 1.25857 6.781439 3.391998 4.279648 8.556738 7.50793 39006 293 -106.6932 36.50962
# replace NA with 0
acs_covid <- acs_covid |> replace_na(list(bpov_pct = 0,
apov_pct = 0))
# tm_covid_rates <- tm_shape(acs_covid) +
#   tm_polygons("covid_rate",
#     title = "COVID-19 Cases per 100,000 people\n(22 January 2020 to 1 August 2020)",
#     style = "quantile",
#     border.alpha = .2,
#     lwd = 0.2,
#     palette = "Oranges",
#   ) +
#   tm_shape(state) +
#     tm_borders("grey", lwd = .5) +
#   tm_layout(
#     legend.position = c("left", "bottom"),
#     legend.title.size = 0.8,
#     legend.text.size = 0.5
#   )
# 
# tm_covid_rates

COVID Incidence Rate TMAP

# Set tmap mode explicitly if needed (e.g., for interactive maps)
# tmap_mode("plot")  # or "view"

tm_covid_rates <- 
  tm_shape(acs_covid) +
    tm_polygons(
      col = "covid_rate",
      title = "COVID-19 Cases per 100,000 people\n(22 Jan 2020 to 1 Aug 2020)",
      style = "quantile",
      border.alpha = 0.2,
      lwd = 0.2,
      palette = "Oranges"
    ) +
  tm_shape(state) +
    tm_borders(
      col = "grey",
      lwd = 0.5
    ) +
  tm_layout(
    legend.position = c("left", "bottom"),
    legend.title.size = 0.8,
    legend.text.size = 0.5
  )
## 
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "quantile"`, use fill.scale =
## `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style', 'palette' (rename to 'values') to
##   'tm_scale_intervals(<HERE>)'
## [v3->v4] `tm_polygons()`: use 'fill' for the fill color of polygons/symbols
## (instead of 'col'), and 'col' for the outlines (instead of 'border.col').
## [v3->v4] `tm_polygons()`: use `col_alpha` instead of `border.alpha`.
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
tm_covid_rates
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Oranges" is named
## "brewer.oranges"
## Multiple palettes called "oranges" found: "brewer.oranges", "matplotlib.oranges". The first one, "brewer.oranges", is returned.

Disability Map

# tm_disability_rates <- tm_shape(acs_covid) +
#   tm_polygons("dis_pct",
#     title = "Percent of People with Disability\n(ACS 2014-2018)",
#     style = "quantile",
#     col_alpha = .2,
#     lwd = 0.2,
#     palette = "Oranges"
#   ) +
#   tm_shape(state) +
#   tm_borders("grey", lwd = .5) +
#   tm_layout(
#     legend.position = c("left", "bottom"),
#     legend.title.size = 0.8,
#     legend.text.size = 0.5
#   )
# 
# tm_disability_rates

library(tmap)

# Set tmap mode explicitly if needed
# tmap_mode("plot")  # or "view"

tm_disability_rates <- 
  tm_shape(acs_covid) +
    tm_polygons(
      col = "dis_pct",
      title = "Percent of People with Disability\n(ACS 2014–2018)",
      style = "quantile",
      alpha = 1,
      lwd = 0.2,
      palette = "Oranges"
    ) +
  tm_shape(state) +
    tm_borders(
      col = "grey",
      lwd = 0.2
    ) +
  tm_layout(
    legend.position = c("left", "bottom"),
    legend.title.size = 0.8,
    legend.text.size = 0.5
  )

tm_disability_rates

Summary Stats

acs_covid_stats <- acs_covid %>%
  st_drop_geometry() %>%
  select(covid_rate, contains("pct")) %>%
  stat.desc(norm = TRUE) %>%
  round(2) %>%
  t() %>%
  as.data.frame() %>%
  select(min, max, mean, SD = std.dev, ShapiroWilk = normtest.W, p = normtest.p)

acs_covid_stats %>%
  kable(caption = "Reproduced Descriptive Statistics",
        align = "c") %>%
  column_spec(2:6, width_min = "5em") %>%
  column_spec(7, width_min = "2em") %>%
  kable_styling(full_width = FALSE)
Reproduced Descriptive Statistics
min max mean SD ShapiroWilk p
covid_rate 0.00 14257.17 966.90 1003.96 0.74 0
dis_pct 3.83 33.71 15.95 4.40 0.98 0
white_pct 0.85 33.26 13.55 4.63 0.98 0
black_pct 0.00 20.70 1.48 2.66 0.61 0
native_pct 0.00 13.74 0.28 0.94 0.28 0
asian_pct 0.00 3.45 0.09 0.18 0.51 0
other_pct 0.00 15.24 0.55 0.65 0.57 0
non_hisp_white_pct 0.10 33.16 12.84 4.81 0.99 0
hisp_pct 0.00 25.26 0.99 2.15 0.42 0
non_hisp_non_white_pct 0.00 20.93 2.13 2.75 0.70 0
bpov_pct 0.00 14.97 3.57 1.85 0.93 0
apov_pct 0.00 27.30 12.48 3.06 0.99 0
pct_5_17 0.00 5.08 1.03 0.48 0.95 0
pct_18_34 0.00 5.59 1.56 0.67 0.96 0
pct_35_64 1.01 18.36 6.35 2.30 0.96 0
pct_65_74 0.00 12.73 3.09 1.16 0.95 0
pct_75 0.00 11.13 3.87 1.19 0.97 0
male_pct 1.30 18.19 8.06 2.37 0.98 0
female_pct 1.91 19.94 7.90 2.26 0.98 0
# load original table 1 results
table1 <- read.csv(here("data", "raw", "public", "table1.csv"))

# subtract original results from reproduced results
(select(acs_covid_stats, min, max, mean, SD) -
  select(table1, min, max, mean, SD)) %>%
  kable(caption = "Descriptive Statistics Comparison",
        align = "c") %>%
  column_spec(2:5, width = "4em") %>%
  kable_styling(full_width = FALSE)
Descriptive Statistics Comparison
min max mean SD
covid_rate 0 0.17 -0.1 -0.04
dis_pct 0 0.00 0.0 0.00
white_pct 0 0.00 0.0 0.00
black_pct 0 0.00 0.0 0.00
native_pct 0 0.00 0.0 0.00
asian_pct 0 0.00 0.0 0.00
other_pct 0 0.00 0.0 0.00
non_hisp_white_pct 0 0.00 0.0 0.00
hisp_pct 0 0.00 0.0 0.00
non_hisp_non_white_pct 0 0.00 0.0 0.00
bpov_pct 0 0.00 0.0 0.00
apov_pct 0 0.00 0.0 0.00
pct_5_17 0 0.00 0.0 0.00
pct_18_34 0 0.00 0.0 0.00
pct_35_64 0 0.00 0.0 0.00
pct_65_74 0 0.00 0.0 0.00
pct_75 0 0.00 0.0 0.00
male_pct 0 0.00 0.0 0.00
female_pct 0 0.00 0.0 0.00
rm(acs_covid_stats)

Pearsons

df <- sum(!is.na(acs_covid$dis_pct)) - 2

pearsons_r <- acs_covid %>%
  select(where(is.numeric)) %>%
  st_drop_geometry() %>%
  cor(method = "pearson", use = "everything") %>%
  as.data.frame() %>%
  select(r = covid_rate) %>%
  mutate(
    t = abs(r) / sqrt((1 - r^2) / (df)),
    p = pt(t, df, lower.tail = FALSE)
  ) %>%
  round(3) %>%
  rownames_to_column("variable") %>%
  filter(variable != "covid_rate")

pearsons_r %>%
  kable(caption = "Reproduced Pearson's R",
        align = "c") %>%
  column_spec(2:4, width = "4em") %>%
  kable_styling(full_width = FALSE)
Reproduced Pearson’s R
variable r t p
dis_pct -0.060 3.350 0.000
white_pct -0.332 19.612 0.000
black_pct 0.460 28.847 0.000
native_pct 0.019 1.072 0.142
asian_pct 0.094 5.272 0.000
other_pct 0.026 1.460 0.072
non_hisp_white_pct -0.361 21.545 0.000
hisp_pct 0.119 6.686 0.000
non_hisp_non_white_pct 0.442 27.429 0.000
bpov_pct 0.106 5.914 0.000
apov_pct -0.151 8.513 0.000
pct_5_17 0.084 4.688 0.000
pct_18_34 0.063 3.493 0.000
pct_35_64 -0.008 0.460 0.323
pct_65_74 -0.091 5.113 0.000
pct_75 -0.186 10.541 0.000
male_pct -0.134 7.519 0.000
female_pct 0.023 1.305 0.096
pop 0.128 7.215 0.000
cases 0.209 11.891 0.000
x 0.099 5.540 0.000
y -0.412 25.195 0.000

Comparison of the reproduced Pearson’s r correlation coefficients to the original study’s Pearson’s r correlation coefficients. Stars indicates the significance level with two stars for p < 0.01 and one star for p < 0.05. Correlation difference rp_r_diff is calculated between the reproduction study rp_r and original study or_r as rp_r_diff = rp_r - or_r Direction difference rp_dir_diff is calculated as (rp_r > 0) - (or_r > 0), giving 0 if both coefficients have the same direction, 1 if the reproduction is positive and the original is negative, and -1 if the reproduction is negative but the original is positive.

# calculate number of significance stars at p < 0.01 and p < 0.05 levels.
pearsons_r <- mutate(pearsons_r, rp_stars = as.numeric(as.character(cut(p,
  breaks = c(-0.1, 0.01, 0.05, 1),
  labels = c(2, 1, 0)
))))

# join reproduction coefficients to original study coefficients
correlations <- table1 %>%
  filter(variable != "covid_rate") %>%
  select(variable, or_r = r, or_stars = stars) %>%
  left_join(select(pearsons_r, variable, rp_r = r, rp_stars), by = "variable")

# find difference between coefficient and stars
correlations <- correlations %>%
  bind_cols(rename_with(
    correlations[, 4:5] - correlations[, 2:3],
    ~ paste0(.x, "_diff")
  ))

# find coefficients with different directions
correlations <- correlations %>% mutate(rp_dir_diff = (rp_r > 0) - (or_r > 0))

correlations %>%
  kable(caption = "Compare reproduced and original Pearson's R",
        col.names = c("Variable", "R", "Sig. Level", "R", "Sig. Level", "R", "Sig. Level", "Direction"),
        align = "c") %>%
  kable_styling() %>%
  add_header_above(c(" " = 1, "Original" = 2, "Reproduced" = 2, "Difference" = 3))
Compare reproduced and original Pearson’s R
Original
Reproduced
Difference
Variable R Sig. Level R Sig. Level R Sig. Level Direction
dis_pct -0.056 2 -0.060 2 -0.004 0 0
white_pct -0.326 2 -0.332 2 -0.006 0 0
black_pct 0.456 2 0.460 2 0.004 0 0
native_pct 0.020 0 0.019 0 -0.001 0 0
asian_pct 0.097 2 0.094 2 -0.003 0 0
other_pct 0.028 0 0.026 0 -0.002 0 0
non_hisp_white_pct -0.355 2 -0.361 2 -0.006 0 0
hisp_pct 0.119 2 0.119 2 0.000 0 0
non_hisp_non_white_pct 0.439 2 0.442 2 0.003 0 0
bpov_pct 0.108 2 0.106 2 -0.002 0 0
apov_pct -0.146 2 -0.151 2 -0.005 0 0
pct_5_17 0.083 2 0.084 2 0.001 0 0
pct_18_34 0.066 1 0.063 2 -0.003 1 0
pct_35_64 -0.005 0 -0.008 0 -0.003 0 0
pct_65_74 -0.089 2 -0.091 2 -0.002 0 0
pct_75 -0.181 2 -0.186 2 -0.005 0 0
male_pct -0.131 2 -0.134 2 -0.003 0 0
female_pct 0.028 0 0.023 0 -0.005 0 0

Reproduction correlation coefficients varied slightly from the original study coefficients by +/- 0.006. All but one Pearson’s correlation coefficient was significant to the same level, and the exception was age 18 to 34. Counter-intuitively, the correlation coefficient was slightly closer to 0 but the p value was also found to be more significant, suggesting a difference in the estimation of t and/or p, or a typographical error. All of the coefficients had the same direction.

Unplanned Deviation for Reproduction: We should expect identical results for this correlation test, so we loaded the original author’s data from Aug1GEEdata.csv to re-test the statistic, calculated as unplanned_r below.

# load author-provided original data
original_gee <- read.csv(here("data", "raw", "public", "Aug1GEEdata.csv"))

# calculate correlation coefficients using original data
original_gee %>%
  select(Incidence, PerDisable, starts_with("PD")) %>%
  cor(method = "pearson", use = "everything") %>%
  as.data.frame() %>%
  rownames_to_column("or_variable") %>%
  filter(or_variable != "Incidence") %>%
  select(or_variable, unplanned_r = Incidence) %>%
  bind_cols(correlations[, 1:2]) %>%
  mutate(unplanned_r = round(unplanned_r, 3), diff = unplanned_r - or_r) %>%
  select(variable, unplanned_r, or_r, diff) %>%
  kable(caption = "Recalculation of Pearson's R with original data",
        align = "c",
        ) %>%
  kable_styling(full_width = FALSE)
Recalculation of Pearson’s R with original data
variable unplanned_r or_r diff
dis_pct -0.056 -0.056 0
white_pct -0.326 -0.326 0
black_pct 0.456 0.456 0
native_pct 0.020 0.020 0
asian_pct 0.097 0.097 0
other_pct 0.028 0.028 0
non_hisp_white_pct -0.355 -0.355 0
hisp_pct 0.119 0.119 0
non_hisp_non_white_pct 0.439 0.439 0
bpov_pct 0.108 0.108 0
apov_pct -0.146 -0.146 0
pct_5_17 0.083 0.083 0
pct_18_34 0.066 0.066 0
pct_35_64 -0.005 -0.005 0
pct_65_74 -0.089 -0.089 0
pct_75 -0.181 -0.181 0
male_pct -0.131 -0.131 0
female_pct 0.028 0.028 0

The author’s original data produced coefficients identical to the original publication! Is it possible that the data values are correct but have been reassigned / transposed to different counties?

Unplanned Deviation for Reproduction: Considering the precise bitwise reproduction of descriptive statistics and of correlation statistics from author-provided data, we decided to recalculate the COVID-19 incidence rate with author-provided case and population data for comparison to the author-provided incidence rate.

# recalculate Incidence Rate
original_gee <- original_gee %>%
  mutate(recalc_Incidence = round(Cases / Total_POP * 100000, 2))

# compare recalculation to author-provided original data and print any counties
# with inconsistent results
original_gee %>%
  filter(recalc_Incidence != Incidence) %>%
  select(FIPS = COUNTY_FIPS, State = ST_Name, County = Countyname, Population = Total_POP, Cases, OR_Incidence = Incidence, RPr_Incidence = recalc_Incidence) %>%
  mutate(Difference = RPr_Incidence - OR_Incidence) %>%
  kable(caption = "Counties with inconsistent COVID-19 incidence rate") %>%
  kable_styling(latex_options = "scale_down")
Counties with inconsistent COVID-19 incidence rate
FIPS State County Population Cases OR_Incidence RPr_Incidence Difference
1115 Alabama St. Clair 88690 1151 1349.52 1297.78 -51.74
1117 Alabama Shelby 215707 2911 1297.78 1349.52 51.74
5123 Arkansas St. Francis 25439 1112 704.16 4371.24 3667.08
5125 Arkansas Saline 121421 855 397.33 704.16 306.83
5127 Arkansas Scott 10319 41 314.15 397.33 83.18
5129 Arkansas Searcy 7958 25 1322.08 314.15 -1007.93
5131 Arkansas Sebastian 127753 1689 5420.39 1322.08 -4098.31
5133 Arkansas Sevier 17139 929 570.08 5420.39 4850.31
5135 Arkansas Sharp 17366 99 4371.24 570.08 -3801.16
8039 Colorado Elbert 26282 85 626.60 323.42 -303.18
8041 Colorado El Paso 713856 4473 323.42 626.60 303.18
8065 Colorado Lake 7824 70 349.85 894.68 544.83
8067 Colorado La Plata 56310 197 894.68 349.85 -544.83

We found that 13 counties had incorrect COVID-19 incidence scores, and the scores seem to be transposed from other counties, such that the overall descriptive statistics were accurate but the correlation coefficients were inaccurate. This finding implies that subsequent analyses using the COVID-19 Incidence rate will be slightly different and more accurate in this reproduction study than in the original study.

Unplanned deviation for reproduction: Original author’s Incidence data joined into our reproduction data frame so that we can later test for sensitivity to this error. Then report any counties for which the reproduced COVID incidence rate differs from the original author’s COVID incidence rate.

original_incidence <- original_gee %>%
  select(COUNTY_FIPS, or_incidence = Incidence) %>%
  mutate(fips =
           ifelse(COUNTY_FIPS >= 10000,
                  as.character(COUNTY_FIPS),
                  paste0("0", COUNTY_FIPS)
                  )
         )
  # calculates a text version of FIPS code for joining, while adding back
  # the leading '0' if the code was less than 10000

acs_covid <- acs_covid %>%
  left_join(original_incidence, by = "fips")

rm(original_incidence)

acs_covid %>%
  st_drop_geometry %>%
  filter(covid_rate != or_incidence) %>%
  arrange(fips) %>%
  select(county_st, covid_rate, or_incidence) %>%
  kable(caption = "Original incidence rate joined to reproduction data") %>% kable_styling()
Original incidence rate joined to reproduction data
county_st covid_rate or_incidence
St. Clair County, Alabama 1297.78 1349.52
Shelby County, Alabama 1349.52 1297.78
St. Francis County, Arkansas 4371.24 704.16
Saline County, Arkansas 704.16 397.33
Scott County, Arkansas 397.33 314.15
Searcy County, Arkansas 314.15 1322.08
Sebastian County, Arkansas 1322.08 5420.39
Sevier County, Arkansas 5420.39 570.08
Sharp County, Arkansas 570.08 4371.24
Elbert County, Colorado 323.42 626.60
El Paso County, Colorado 626.60 323.42
Lake County, Colorado 894.68 349.85
La Plata County, Colorado 349.85 894.68

The join highlights the same 13 counties with inconsistent incidence rates. This also confirms that our reproduced dependent variable is identical to the original dependent variable with the exception of these three counties.

Bivariate nonparametric correlation analysis

Unplanned Deviation for Reproduction: The dependent and independent variables in this study do not have normal distributions, as shown in the Shapiro-Wilk test results above. Therefore, we deviate from the original study to use the Spearman’s Rho non-parametric correlation test.

df <- sum(!is.na(acs_covid$dis_pct)) - 2

spearmans_rho <- acs_covid %>%
  select(where(is.numeric)) %>%
  st_drop_geometry() %>%
  cor(method = "spearman", use = "everything") %>%
  as.data.frame() %>%
  select(rho = covid_rate) %>%
  mutate(
    t = abs(rho) / sqrt((1 - rho^2) / (df)),
    p = pt(t, df, lower.tail = FALSE)
  ) %>%
  round(3) %>%
  rownames_to_column("variable") %>%
  filter(variable != "covid_rate")

CSpearman’s rho correlation coefficients comparee to the reproduced Pearson’s r correlation coefficients. Differences are calculated as Spearman’s Rho - Pearson’s R.

# calculate number of significance stars at p<0.01 and P<0.05 levels.
spearmans_rho <- mutate(spearmans_rho, rp_rho_stars = as.numeric(as.character(cut(p,
  breaks = c(-0.1, 0.01, 0.05, 1),
  labels = c(2, 1, 0)
))))

correlations <- correlations[, 1:8] %>%
  left_join(select(spearmans_rho, variable, rp_rho = rho, rp_rho_stars), by = "variable")

corrdiff <- select(correlations, starts_with("rp_rho")) -
  select(correlations, rp_r, rp_stars)

correlations <- correlations %>% bind_cols(rename_with(corrdiff, ~ paste0(.x, "_diff")))
rm(corrdiff)

correlations <- correlations %>% mutate(rp_rho_dir_diff = (rp_rho > 0) - (rp_r > 0))

correlations %>%
  select(variable, rp_r, rp_stars, starts_with("rp_rho")) %>%
  kable(col.names = c("Variable", "R", "Stars", "Rho", "Stars", "Rho - R", "Stars", "Direction"),
        align = "c") %>%
  #column_spec(2:6, width_min = "5em") %>%
  kable_styling() %>%
  add_header_above(c(" " = 1, "Pearson's" = 2, "Spearman's" = 2, "Difference" = 3))
Pearson’s
Spearman’s
Difference
Variable R Stars Rho Stars Rho - R Stars Direction
dis_pct -0.060 2 -0.113 2 -0.053 0 0
white_pct -0.332 2 -0.421 2 -0.089 0 0
black_pct 0.460 2 0.575 2 0.115 0 0
native_pct 0.019 0 -0.084 2 -0.103 2 -1
asian_pct 0.094 2 0.194 2 0.100 0 0
other_pct 0.026 0 0.104 2 0.078 2 0
non_hisp_white_pct -0.361 2 -0.454 2 -0.093 0 0
hisp_pct 0.119 2 0.231 2 0.112 0 0
non_hisp_non_white_pct 0.442 2 0.481 2 0.039 0 0
bpov_pct 0.106 2 0.062 2 -0.044 0 0
apov_pct -0.151 2 -0.205 2 -0.054 0 0
pct_5_17 0.084 2 0.079 2 -0.005 0 0
pct_18_34 0.063 2 0.034 1 -0.029 -1 0
pct_35_64 -0.008 0 -0.020 0 -0.012 0 0
pct_65_74 -0.091 2 -0.151 2 -0.060 0 0
pct_75 -0.186 2 -0.285 2 -0.099 0 0
male_pct -0.134 2 -0.201 2 -0.067 0 0
female_pct 0.023 0 -0.014 0 -0.037 0 -1

Three variables change significance levels, with Native American and Other races gaining significance and age 18-34 losing significance. Two correlations change direction, with both Native American race (illustrated in scatterplot below) and Female households switching from positive correlations to negative correlations. Instabilities between the parametric and non-parametic correlations arise from variables with very skewed distributions and/or weak correlations at the county level. Some difference may also be attributable to the 13 counties with data errors in the COVID-19 Incidence Rate. In such distributions, outlier observations have more weight in the parametric Person’s R test than in the non-parametric Spearman’s Rho test.

plot(acs_covid$native_pct,
  acs_covid$covid_rate,
  xlab = "Percent Native American",
  ylab = "COVID-19 Incidence",
  pch = 16,
  col = rgb(0, 0, 0, 0.05),
  cex.lab = 0.8,
  cex.axis = 0.5,
)
lines(abline(lm(acs_covid$covid_rate ~ acs_covid$native_pct)))

rm(spearmans_rho, pearsons_r, correlations, table1, df)

Kulldorff spatial scan statistic

Although this study did not involve major geographic transformations, it employed specific geographic grouping criteria for the generalized estimating equation (GEE) models, combining state boundaries with COVID-19 risk levels identified using the Kulldorff spatial scan statistic. This statistic, implemented in the SaTScan software, detects spatial clusters of high COVID-19 incidence using spherical great-circle distance calculations based on county centroid coordinates. The geographic coordinates (X and Y attributes) used for these calculations were sourced from the ACS dataset.

The Kulldorff scan statistic relies on Monte Carlo simulations to determine the statistical significance of clusters, leading to potentially varying results across runs. While the original manuscript specifies use of a Poisson model in SaTScan, other model parameters appear to follow the software’s defaults. These include discrete, spatial-only modeling (excluding temporal analysis), a maximum cluster size of 50% of the population at risk, GINI-optimized cluster selection, and non-overlapping secondary clusters. The “P-value Cutoff” setting did not appear in version 9.6, implying that only the default setting was available at the time of analysis.

SaTScan software can also output two versions of geographic data:

  • The col cluster polygon shapefile contains a circle for each cluster, where each polygon is a circle defined by the cluster center and radius. The attributes include a variable REL_RISK for cluster relative risk
  • The gis location point shapefile contains one point for each county in a cluster. The attributes include variables LOC_RR for local relative risk and CLU_RR for cluster relative risk

The SaTScan software implementation of the Kulldorff spatial scan statistic calculates two relative risk scores for locations:

  • Cluster relative risk is the incidence rate of the population within the cluster divided by the incidence rate of the population outside of the cluster. This is calculated as REL_RISK in the col cluster polygon shapefile and as CLU_RR in the gis location point shapefile.
  • Local relative risk is the incidence rate of population within a location divided by the incidence rate of the population outside of the location. This is calculated as LOC_RR in the gis location shapefile, and is not calculated in the col cluster polygon shapefile.

For the purposes of interpreting the spatial scan statistic, a location is a county centroid while a cluster is a collection of counties with high incidence rates, defined in the shape of a circle with a center location (a county centroid) and a radius.

The original study is not clear about using the cluster geographic data vs the location geographic data or the cluster relative risk vs local relative risk. However, The author-provided SatScan_results.txt results file indicates a geographic cluster file but no location file, and the author-provided Aug1GEEdata.csv data table contains a REL_RISK field but no CLU_RR field or LOC_RR field. This suggests that in the original study, the col polygon cluster shapefile and cluster relative risk were used to represent COVID-19 risk and define GEE clusters.

The spatial scan statistic is based on case counts and total population, and is therefore unaffected by errors in the COVID Incidence rate.

Planned deviation for reproduction: We opted to use the SpatialEpi package in R, selecting open source software with R integration over SatSCan software, which is free but not open. The Kulldorff spatial scan statistic model in SpatialEpi also supports a discrete Poisson spatial model, and uses the GINI coefficient to select secondary clusters with no geographical overlap that maximize the difference between locations inside of clusters and locations outside of clusters. We expected that this set of software options could reproduce identical results compared to SaTScan.

First, calculate the Kulldorff spatial scan statistic using SpatialEpi. Optionally, skip this code block due to long run times of more than 10 minutes.

start_time <- Sys.time()
covid_geo <- covid_table %>%
  select(x, y) %>%
  latlong2grid()
# latlong2grid creates approximate equidistant cylindrical grid
# could probably reproject to epsg 5070 and create table with x and y

# calculate expected cases with one strata
expected.cases <- expected(covid_table$pop, covid_table$cases, 1)

# Kulldorff spatial scan statistic
covid_kulldorff <- kulldorff(
  geo = covid_geo,
  cases = covid_table$cases,
  population = covid_table$pop,
  expected.cases = expected.cases,
  pop.upper.bound = 0.5,
  n.simulations = 999,
  alpha.level = 0.05,
  plot = TRUE
)

print(
  paste(
    "Run time:",
    round(difftime(Sys.time(), start_time, units = "mins"), 2),
    "minutes"
  ),
  quote = FALSE
)
rm(covid_geo, expected.cases, start_time)
# save results in a file appended with the current date
saveRDS(covid_kulldorff,
  file = here("data", "derived", "public", paste0("covid_kulldorff_", Sys.Date(), ".RDS"))
)

Load pre-calculated Kulldorff spatial scan results.

# load pre-calculated Kulldorff results
# alternatively, modify the file name with an appended date to load a more current set of results
covid_kulldorff <- readRDS(
  here("data", "raw", "public", "covid_kulldorff.RDS")
)

Report Kulldorff spatial scan results.

print("Most likely cluster:", quote = FALSE)
## [1] Most likely cluster:
covid_kulldorff$most.likely.cluster
## $location.IDs.included
##  [1] 1824 1835 1797 1818 1825 1749 1854 1742 1837 1747 1838  280 1760 1846 1756
## 
## $population
## [1] 16949211
## 
## $number.of.cases
## [1] 469091
## 
## $expected.cases
## [1] 233805.6
## 
## $SMR
## [1] 2.006329
## 
## $log.likelihood.ratio
## [1] 97983.07
## 
## $monte.carlo.rank
## [1] 1
## 
## $p.value
## [1] 0.001
print(
  paste0(
    "Number of Secondary clusters: ",
    length(covid_kulldorff$secondary.clusters)
  ),
  quote = FALSE
)
## [1] Number of Secondary clusters: 134

Clusters include the county at the center of a cluster and all of the other counties within the cluster radius. FIPS code of the county at the center of each cluster as the unique cluster ID is used.

# list of primary cluster locations (counties)
cluster_locations <- covid_kulldorff$most.likely.cluster$location.IDs.included

# create data frame of clusters and
# calculate the clusterID as the first (center) county FIPS code
clusters <- covid_table[cluster_locations, "fips"] %>%
  mutate(clusterID = covid_table[[cluster_locations[1], "fips"]],
         likelihood = covid_kulldorff$most.likely.cluster$log.likelihood.ratio)

# Get a list of secondary clusters
secondary <- covid_kulldorff$secondary.clusters

# similarly add counties in each secondary cluster to the list of clusters
for (i in secondary) {
  cluster_locations <- i$location.IDs.included
  new_clusters <- covid_table[cluster_locations, "fips"] %>%
    mutate(clusterID = covid_table[[cluster_locations[1], "fips"]],
           likelihood = i$log.likelihood.ratio)
  clusters <- clusters %>% rbind(new_clusters)
}

rm(cluster_locations, secondary, i, new_clusters)

Map Kulldorff clusters

Unplanned deviation for reproduction: The original study does not include visualizations of the spatial structure and distribution of COVID-19 clusters.

Join the Kulldorff spatial scan cluster IDs to the acs_covid simple features.

Calculate a new field isCluster to identify counties in COVID-19 clusters. Distinguish between counties defining the center of a cluster from counties constituting other parts of a cluster by comparing the cluster ID (equivalent to the center county’s fips code) to the county fips code.

acs_covid <- acs_covid[, 1:which(colnames(acs_covid) == "or_incidence")] %>%
  left_join(clusters, by = "fips") %>%
  mutate(isCluster = case_when(
    clusterID == fips ~ "center of cluster",
    !is.na(clusterID) ~ "other part of cluster",
    .default = NA
  ))

Planned deviation for reproduction: Map the SpatialEpi cluster results.

tm_spatialepi_clusters <-
  tm_shape(state) +
    tm_fill("gray98") +
  tm_shape(acs_covid) +
  tm_polygons(col = "isCluster",
          palette = "-Oranges",
          popup.vars = c("fips", "clusterID"),
          colorNA = NULL,
          title = "SpatialEpi Kulldorff COVID-19 Clusters",
          border.col = "white",
          lwd = 0.2,
          border.alpha = 0.2) +
  tm_shape(state) +
    tm_borders("grey", lwd = 0.5) +
  tm_layout(
    legend.position = c("left", "bottom"),
    legend.title.size = 0.8,
    legend.text.size = 0.5
  )
## 
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_tm_polygons()`: migrate the argument(s) related to the scale of
## the visual variable `fill` namely 'palette' (rename to 'values'), 'colorNA'
## (rename to 'value.na') to fill.scale = tm_scale(<HERE>).
## ℹ For small multiples, specify a 'tm_scale_' for each multiple, and put them in
##   a list: 'fill.scale = list(<scale1>, <scale2>, ...)'
## [v3->v4] `tm_polygons()`: use `col_alpha` instead of `border.alpha`.
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
# tm_spatialepi_clusters

Unplanned deviation for reproduction: Calculate local and cluster relative risk

total_pop <- sum(acs_covid$pop)
total_cases <- sum(acs_covid$cases)

acs_covid <- acs_covid %>%
  group_by(clusterID) %>%
  mutate(
    rr_cluster = ifelse(is.na(clusterID), NA,
      (sum(cases) / sum(pop)) / ((total_cases - sum(cases)) / (total_pop - sum(pop)))
    )
  ) %>%
  ungroup() %>%
  mutate(
    rr_loc = (cases / pop) / ((total_cases - cases) / (total_pop - pop))
  )

rm(total_pop, total_cases)

Classify relative risk on a scale from 1 to 6 (workflow step 8). Risk is classified according to this table:

Relative Risk Values Relative Risk Class
Outside of cluster 1
RR < 1 1
1 <= RR < 2 2
2 <= RR < 3 3
3 <= RR < 4 4
4 <= RR < 5 5
5 <= RR < 6 6

Counties falling outside of any cluster are assigned a score of 1.

# class breaks
breaks <- c(-Inf, 1, 2, 3, 4, 5, Inf)

acs_covid <- acs_covid %>%
  mutate(
    cluster_class = ifelse(is.na(clusterID), 1, cut(rr_cluster, breaks, labels = FALSE)),
    loc_class = cut(rr_loc, breaks, labels = FALSE)
  )

Map relative risk scores

Unplanned deviation for reproduction:

Spatial distribution of local relative risk score classifications.

# count the frequency of counties in each class and create labels
class_freq <- acs_covid %>% st_drop_geometry() %>% count(loc_class)
class_freq$qual <- ifelse(class_freq$n > 1, " counties", " county")
class_freq[1, ]$qual <- paste(class_freq[1, ]$qual, "at low risk")
class_freq[nrow(class_freq), ]$qual <- paste(class_freq[nrow(class_freq), ]$qual, "at high risk")
class_freq$label <- paste0(class_freq$loc_class,
                           " (",
                           class_freq$n,
                           class_freq$qual,
                           ")")

# Map Local Relative Risk scores
# tm_spatialepi_local_risk_class <- tm_shape(acs_covid) +
#   tm_polygons("loc_class",
#     title = "Local Relative Risk Class",
#     border.col = "white",
#     border.alpha = .2,
#     lwd = 0.2,
#     palette = "Oranges",
#     style = "cat",
#     labels = class_freq$label
#   ) +
#   tm_shape(state) +
#   tm_borders("grey", lwd = .5) +
#   tm_layout(
#     legend.position = c("left", "bottom"),
#     legend.title.size = 0.8,
#     legend.text.size = 0.5
#   )

library(tmap)

# Set tmap mode explicitly if needed
# tmap_mode("plot")  # or "view"

tm_spatialepi_local_risk_class <- 
  tm_shape(acs_covid) +
    tm_polygons(
      col = "loc_class",
      title = "Local Relative Risk Class",
      border.col = "white",
      border.alpha = 0.2,
      lwd = 0.2,
      palette = "Oranges",
      style = "cat",
      labels = class_freq$label
    ) +
  tm_shape(state) +
    tm_borders(
      col = "grey",
      lwd = 0.5
    ) +
  tm_layout(
    legend.position = c("left", "bottom"),
    legend.title.size = 0.8,
    legend.text.size = 0.5
  )
## [v3->v4] `tm_polygons()`: instead of `style = "cat"`, use fill.scale =
## `tm_scale_categorical()`.
## ℹ Migrate the argument(s) 'palette' (rename to 'values'), 'labels' to
##   'tm_scale_categorical(<HERE>)'
## [v3->v4] `tm_polygons()`: use `col_alpha` instead of `border.alpha`.
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
tm_spatialepi_local_risk_class
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Oranges" is named
## "brewer.oranges"
## Multiple palettes called "oranges" found: "brewer.oranges", "matplotlib.oranges". The first one, "brewer.oranges", is returned.

rm(class_freq)

tm_spatialepi_local_risk_class
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Oranges" is named
## "brewer.oranges"
## Multiple palettes called "oranges" found: "brewer.oranges", "matplotlib.oranges". The first one, "brewer.oranges", is returned.

Map of cluster relative risk scores for comparison. - Counties outside of clusters are assigned the lowest risk class of 1.

# count the frequency of counties in each class and create labels
class_freq <- acs_covid %>% st_drop_geometry() %>% count(cluster_class)
class_freq$qual <- ifelse(class_freq$n > 1, " counties", " county")
class_freq[1, ]$qual <- paste(class_freq[1, ]$qual, "at low risk")
class_freq[nrow(class_freq), ]$qual <- paste(class_freq[nrow(class_freq), ]$qual, "at high risk")
class_freq$label <- paste0(class_freq$cluster_class,
                           " (",
                           class_freq$n,
                           class_freq$qual,
                           ")")

# map cluster relative risk scores
# tm_spatialepi_cluster_risk_class <- tm_shape(acs_covid) +
#   tm_polygons("cluster_class",
#     title = "Cluster Relative Risk Class",
#     border.col = "white",
#     border.alpha = .2,
#     lwd = 0.2,
#     palette = "Oranges",
#     style = "cat",
#     labels = class_freq$label
#   ) +
#   tm_shape(state) +
#   tm_borders("grey", lwd = .5) +
#   tm_layout(
#     legend.position = c("left", "bottom"),
#     legend.title.size = 0.8,
#     legend.text.size = 0.5
#   )


library(tmap)

# Set tmap mode explicitly if needed
# tmap_mode("plot")  # or "view"

tm_spatialepi_cluster_risk_class <- 
  tm_shape(acs_covid) +
    tm_polygons(
      col = "cluster_class",
      title = "Cluster Relative Risk Class",
      border.col = "white",
      border.alpha = 0.2,
      lwd = 0.2,
      palette = "Oranges",
      style = "cat",
      labels = class_freq$label
    ) +
  tm_shape(state) +
    tm_borders(
      col = "grey",
      lwd = 0.5
    ) +
  tm_layout(
    legend.position = c("left", "bottom"),
    legend.title.size = 0.8,
    legend.text.size = 0.5
  )
## 
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "cat"`, use fill.scale =
## `tm_scale_categorical()`.
## ℹ Migrate the argument(s) 'palette' (rename to 'values'), 'labels' to
##   'tm_scale_categorical(<HERE>)'
## [v3->v4] `tm_polygons()`: use `col_alpha` instead of `border.alpha`.
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
tm_spatialepi_cluster_risk_class
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Oranges" is named
## "brewer.oranges"
## Multiple palettes called "oranges" found: "brewer.oranges", "matplotlib.oranges". The first one, "brewer.oranges", is returned.

rm(class_freq)

tm_spatialepi_cluster_risk_class
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Oranges" is named
## "brewer.oranges"
## Multiple palettes called "oranges" found: "brewer.oranges", "matplotlib.oranges". The first one, "brewer.oranges", is returned.

The larger clusters in the southeast have an averaging impact on the cluster risk scores when compared to smaller clusters in the region.

This effect is more pronounced for clusters with low compactness (e.g. the Southeast cluster stretched over the “black belt” region from Louisiana and Arkansas to Georgia) than clusters with higher compactness (e.g. New York City) because the circular shape of clusters includes more low-risk counties.

Compare clusters

The original study did not directly report any results from the Kulldorff spatial scan statistic. However, the Kulldorff cluster relative risk scores were combined with states to create clusters for GEE models, hereafter called “GEE clusters”. The original study reported 102 unique GEE clusters having a range of 1 to 245 counties in each cluster.

In order to compare results, we first create cluster IDs as combinations of the state ID and COVID relative risk class. The first clustering ID (State) and second clustering score (COVID relative risk class) were combined to form IDs for each unique combination of state and relative risk class. Then, we find the number of unique clusters and frequency counties per cluster in our reproduction study for comparison to the original study.

# calculate clusters
acs_covid <- acs_covid %>% mutate(
  rp_clusID = as.integer(statefp) * 10 + cluster_class
)

# summarize clustering results
cluster_summary <- acs_covid %>%
  filter(cases > 0) %>%
  st_drop_geometry() %>%
  count(rp_clusID)
cat(
  length(cluster_summary$n),
  "unique clusters based on spatialEpi CLUSTER relative risk\n"
)
## 111 unique clusters based on spatialEpi CLUSTER relative risk
summary(cluster_summary$n)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00    7.00   27.56   50.50  159.00
rm(cluster_summary)

We failed to reproduce the same configuration of GEE clusters as the original study, finding 9 more clusters than the original study and a much smaller maximum cluster of 159 counties compared to 245 counties.

Reproduce Kulldorff spatial scan statistic in SaTScan

Unplanned deviation for reproduction: Upon failing to reproduce an identical number of GEE clusters using SpatialEpi in R, we reproduced the procedure in the free but not open SaTScan software, using the current software version 10.1. The input data files (case, Coordinates.geo, and Population.pop), and output data files (sat_scan_rpr.txt, sat_scan_rpr.col.shp, and sat_scan_rpr.gis.shp) are found in the data/derived/public/satscan directory. The sat_scan_rpr.txt file reports the model parameters used in addition to results.

# load author-provided data
author_col <- read.dbf(here("data", "raw", "public","SatScan_output.dbf")) %>%
  select(LOC_ID, or_rel_risk = REL_RISK)

# load SaTScan reproduced data
# Download satscan data
satscan_rpr_col <- read_sf(here("data", "derived","satscan", "sat_scan_rpr.col.shp"))

# how many observations?
cat(
  nrow(satscan_rpr_col),
  " reproduced relative risk observations\n",
  nrow(author_col),
  " author-provided relative risk observations\n",
  sep = ""
)
## 96 reproduced relative risk observations
## 96 author-provided relative risk observations
# join and compare how many observations are identical?
cat(
  satscan_rpr_col %>%
  full_join(author_col, by = "LOC_ID") %>%
  filter(REL_RISK == or_rel_risk & REL_RISK > 0) %>%
  nrow(),
  "reproduced relative risk values match the original author's relative risk values"
  )
## 96 reproduced relative risk values match the original author's relative risk values
rm(author_col)

Our SaTScan results exactly reproduced the author-provided SaTScan results data.

Map SaTScan spatial clusters

Join the SaTScan results to acs_covid for mapping and analysis.

# check if there are any duplicated counties
cat("Joining",
    length(satscan_rpr_col$LOC_ID),
    "records with",
    length(unique(satscan_rpr_col$LOC_ID)),
    "unique LOC_ID county values")
## Joining 96 records with 96 unique LOC_ID county values
# select important non-geographic columns
satscan_rpr_col_t <- satscan_rpr_col %>%
  st_drop_geometry() %>%
  select(fips = LOC_ID, GINI_CLUST, REL_RISK)

# join
acs_covid <- acs_covid[, 1:which(colnames(acs_covid) == "rp_clusID")] %>%
  left_join(satscan_rpr_col_t, by = "fips")

rm(satscan_rpr_col_t)

Unplanned deviation for reproduction: Visualize the spatial distribution of the author-provided Kulldorff COVID-19 Clusters.

# count frequencies of each cluster type
clus_counts <- satscan_rpr_col %>%
  st_drop_geometry() %>%
  group_by(GINI_CLUST) %>%
  summarize(n = n())

# create labels including frequencies in brackets
clus_labels <- c(paste0("Hierarchical (", clus_counts[1,2], ")"),
            paste0("GINI Optimized (", clus_counts[2,2], ")"))

# for clusters with only one county, erase the number of counties
satscan_rpr_col[which(satscan_rpr_col$NUMBER_LOC < 3), ]$NUMBER_LOC <- NA

gini_circle <- satscan_rpr_col %>% filter(GINI_CLUST == 'T')
hier_circle <- satscan_rpr_col %>% filter(GINI_CLUST == 'F')
## filter data so that nulls are removed ahead of time


tm_author_clusters <-
  tm_shape(state) +
    tm_fill("gray98") +
  tm_shape(acs_covid |> filter(!is.na(GINI_CLUST))) +
  tm_polygons(fill = "GINI_CLUST",
              fill.scale = tm_scale(
              labels = clus_labels,
              values = c("tomato", "thistle3")),
              col = "white",
              lwd = 0.5,
              popup.vars = c("fips", "clusterID"),
              fill.legend = tm_legend(title = "SaTScan Kulldorff COVID-19 Clusters\nCluster Centers")) +
  tm_shape(state) +
    tm_borders("grey", lwd = 0.5) +
  tm_shape(gini_circle) +
    tm_borders(col = "thistle4") +
  tm_shape(hier_circle) +
    tm_borders(col = "tomato") +
  tm_layout(
    legend.position = c("left", "bottom"),
    legend.title.size = 0.8,
    legend.text.size = 0.5
  ) 

tm_author_clusters

rm(gini_circle, hier_circle, clus_counts, clus_labels)

In the map above, clusters containing only one county have no visible circle. Clusters containing two counties are encircled, but have no label. Clusters containing three or more counties are encircled and labelled with the number of counties.

Above version of data only includes the 96 counties defining cluster centers, visualized with fill colors above. The data excludes all of the non-center counties in clusters with more than one county. The extent of these larger clusters is visualized by unfilled circles defined by cluster radii.

Additionally, the SaTScan software confusingly merges two sets of clusters in the results when the user uses the (default) option for GINI-optimized clusters. One set of results is a hierarchical non-overlapping set of clusters. These clusters are noted with GINI_CLUST = F in the results. The second set of results is a set of hierarchical non-overlapping clusters designed to maximize the GINI coefficient of inequality between counties within clusters and counties outside of clusters. These clusters are noted with GINI_CLUST = T in the results.

Merged together as they are, the two sets of secondary clusters overlap one another geographically, causing ambiguity in terms of which cluster-based relative risk score should be used at each location.

Unplanned deviation for reproduction: Can we also use these reproduced SaTScan results to exactly reproduce the author-reported frequency of original GEE classes and maximum counties per class? If the results match, it will confirm that the problems identified above have propagated through the original study analysis.

acs_covid <- acs_covid %>%
  mutate(
    ss_cluster_class = ifelse(is.na(REL_RISK), 1, cut(REL_RISK, breaks, labels = FALSE)),
    ss_clusID = as.integer(statefp) * 10 + ss_cluster_class
  )

cluster_summary <- acs_covid %>%
  filter(cases > 0) %>%
  st_drop_geometry() %>%
  count(ss_clusID)
cat(
  length(cluster_summary$n),
  "unique clusters based on spatialEpi CLUSTER relative risk\n"
)
## 102 unique clusters based on spatialEpi CLUSTER relative risk
summary(cluster_summary$n)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    3.00   29.99   58.75  245.00

Using SaTScan Kulldorff clusters, we have exactly reproduced the author-reported frequency of original GEE classes and maximum counties per class. We have confirmed that the original study used the cluster relative risk of the center county of each cluster, including both the hierarchical and GINI-optimized sets of clusters.

Compare SaTScan clusters to SpatialEpi clusters

Unplanned Deviation for Reanalysis: At this point it is clear that the best decision will be to shift from a reproduction study to a reanalysis study, intentionally altering methodological decisions to achieve a more valid outcome. We prefer to include all counties contained in each cluster, and to use only one set of non-overlapping clusters, as produced by the SpatialEpi algorithm.

Given the shifting goal, how sensitive is this study to the choice of computational environment for the Kulldorff scan statistics? To answer this question, we must load the local SaTSCan results inclusive of all counties within clusters, filter the results to focus on the standard hierarchical set of clusters, and compare the spatial distributions of the SaTScan and SpatialEpi results.

# load local SaTScan reproduced data
satscan_rpr_gis_t <- read_sf(
  here("data", "derived", "satscan", "sat_scan_rpr.gis.shp")
) %>%
  st_drop_geometry() %>%
  select(fips = LOC_ID, CLU_RR, GINI_CLUST)

# check if there are any duplicated counties
cat("SaTScan combined GIS output has",
    length(satscan_rpr_gis_t$fips),
    "records with",
    length(unique(satscan_rpr_gis_t$fips)),
    "unique county values\n")
## SaTScan combined GIS output has 1306 records with 922 unique county values
satscan_rpr_gini <- satscan_rpr_gis_t %>% filter(GINI_CLUST == "T") %>% select(fips, gini_rr = CLU_RR)
satscan_rpr_hier <- satscan_rpr_gis_t %>% filter(GINI_CLUST == "F") %>% select(fips, hier_rr = CLU_RR)

# check if there are any duplicated counties
cat("SaTScan Hierarchical clusters include",
    length(satscan_rpr_hier$fips),
    "records with",
    length(unique(satscan_rpr_hier$fips)),
    "unique county values\n")
## SaTScan Hierarchical clusters include 605 records with 605 unique county values
cat("SaTScan GINI-optimized clusters include",
    length(satscan_rpr_gini$fips),
    "records with",
    length(unique(satscan_rpr_gini$fips)),
    "unique county values")
## SaTScan GINI-optimized clusters include 701 records with 701 unique county values
acs_covid <- acs_covid[, 1:which(colnames(acs_covid) == "ss_clusID")] %>%
  left_join(satscan_rpr_gini, by = "fips") %>%
  left_join(satscan_rpr_hier, by = "fips")

rm(satscan_rpr_gis_t, satscan_rpr_gini, satscan_rpr_hier)

It was necessary to divide the Hierarchical clusters from the GINI clusters to avoid duplicates and geographic overlap.

Compare the SaTScan Hierarchical clusters to the SpatialEpi clusters.

# count frequencies of each cluster type
clus_counts <- satscan_rpr_col %>%
  st_drop_geometry() %>%
  group_by(GINI_CLUST) %>%
  summarize(n = n())

# create labels including frequencies in brackets
clus_labels <- c(paste0("Hierarchical (", clus_counts[1,2], ")"),
            paste0("GINI Optimized (", clus_counts[2,2], ")"))

# for clusters with only one county, erase the number of counties
satscan_rpr_col[which(satscan_rpr_col$NUMBER_LOC < 3), ]$NUMBER_LOC <- NA

gini_circle <- satscan_rpr_col %>% filter(GINI_CLUST == 'T')
hier_circle <- satscan_rpr_col %>% filter(GINI_CLUST == 'F')
hier_clusters <- acs_covid %>%
  mutate(xcluster = case_when(
    !is.na(hier_rr) & is.na(isCluster) ~ "SaTScan Hierarchical only",
    !is.na(hier_rr) & !is.na(isCluster) ~ "Both SaTScan and SpatialEpi",
    is.na(hier_rr) & !is.na(isCluster) ~ "SpatialEpi only",
    .default = NA
  )) %>%
  filter(!is.na(xcluster)) %>%
  group_by(xcluster) %>%
  mutate(xn = n()) %>%
  ungroup() %>%
  mutate(xcluster = paste0(xcluster, " (", xn, " counties)"))

tm_spatialepi_hier <-
  tm_shape(state) +
  tm_fill("gray98") +
  tm_shape(hier_clusters) +
  tm_polygons(
    col = "xcluster",
    palette = c("wheat", "tomato", "thistle3"),
    title = "SaTScan Hierarchical and SpatialEpi Clusters",
    border.col = "white",
    lwd = 0.2,
    border.alpha = 0.3
  ) +
  tm_shape(state) +
  tm_borders("grey", lwd = 0.5) +
  tm_layout(
    legend.position = c("left", "bottom"),
    legend.title.size = 0.8,
    legend.text.size = 0.5
  )

tm_spatialepi_hier

rm(hier_clusters)

The two methods only agree on the definition of the largest clusters in distant regions. Thereafter, SpatialEpi detects many secondary clusters in the vicinity of the largest ones, while SaTScan detects seven isolated and low-probability counties.

Compare the SaTScan GINI Optimized clusters to the SpatialEpi clusters.

gini_clusters <- acs_covid %>%
  mutate(xcluster = case_when(
    !is.na(gini_rr) & is.na(isCluster) ~ "SaTScan GINI optimized only",
    !is.na(gini_rr) & !is.na(isCluster) ~ "Both SaTScan and SpatialEpi",
    is.na(gini_rr) & !is.na(isCluster) ~ "SpatialEpi only",
    .default = NA
  )) %>%
  filter(!is.na(xcluster)) %>%
  group_by(xcluster) %>%
  mutate(xn = n()) %>%
  ungroup() %>%
  mutate(xcluster = paste0(xcluster, " (", xn, " counties)"))

tm_spatialepi_gini <-
  tm_shape(state) +
  tm_fill("gray98") +
  tm_shape(gini_clusters) +
  tm_polygons(
    col = "xcluster",
    palette = c("wheat", "tomato", "thistle3"),
    title = "SaTScan GINI Optimized and SpatialEpi Clusters",
    border.col = "white",
    lwd = 0.2,
    border.alpha = 0.3
  ) +
  tm_shape(state) +
  tm_borders("grey", lwd = 0.5) +
  tm_layout(
    legend.position = c("left", "bottom"),
    legend.title.size = 0.8,
    legend.text.size = 0.5
  )

tm_spatialepi_gini

rm(gini_clusters)
acs_covid <- acs_covid %>%
  mutate(
    gini_class = ifelse(is.na(gini_rr), 1, cut(gini_rr, breaks, labels = FALSE)),
    gini_clusID = as.integer(statefp) * 10 + gini_class
  )

table(acs_covid$cluster_class, acs_covid$gini_class) %>%
  kable(row.names = TRUE, caption = "COVID-19 Risk Class by County", align = "c") %>%
  column_spec(2:7, width = "3em") %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(2, bold = TRUE) %>%
  add_header_above(c("SpatialEpi" = 1, "SatScan" = 6)) %>%
  kable_styling(full_width = FALSE, row_label_position = "c")
COVID-19 Risk Class by County
SpatialEpi
SatScan
1 2 3 4 5 6
1 2092 78 0 0 0 0
2 306 528 9 0 0 0
3 7 6 69 1 0 0
4 1 3 0 5 0 0
5 1 0 0 0 1 0
6 0 0 0 0 0 1

END

#Discussion and unplanned deviations

Given the differences across the results and some of the data, our analysis diverged from the original study in a number of ways. Despite reproducing the satscan package and the author’s other data files, discrepancies with results were found. The satscan clusters, in certain cases, misrepresented a cluster, assigning it to a larger region when the original study may have perscribed the Covid risk to that of only a singular county. On the other hand, the summary statistics did align with those from the original study and many of the clusters were aligned. There is more agreement overall between SpatialEpi and SaTScan GINI Optimized clusters. The two algorithms agree the most for smaller and less significant clusters above the 95% confidence threshold. Because the SaTScan clusters are more limited in size, SaTScan detects several smaller clusters with gaps in place of the largest SpatialEpi clusters.